Questions

How can I add variables to my data? How can I alter the variables already in my data?

Objectives

To be able to add new variables to the data

To understand the basic consepts between different data types

Short recap of previous sessions

Welcome back to the second day! Yesterday we covered how to use ggplot to create publication quality visualisations, and dplyr to subset our data. Let’s just look at a couple of exmples of what we did.

ggplot2 1

library(tidyverse)
penguins <- palmerpenguins::penguins

penguins %>% 
  ggplot(aes(x = bill_length_mm,
             y = bill_depth_mm,
             colour = species)) +
  geom_jitter() +
  geom_smooth(method = "lm") + 
  geom_smooth(method = "lm", colour = "black")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

ggplot2 2

penguins %>% 
  ggplot(aes(x = bill_length_mm,
             y = bill_depth_mm,
             colour = species)) +
 geom_density2d() +
 labs(
   title = "Penguin bill length and depth association",
   subtitle = "of three species of penguin",
   color = "Species", 
   y = "Bill depth (mm)",
   x = "Bill length (mm)",
   caption = "Source: DOI 10.1371/journal.pone.0090081"
 )
## Warning: Removed 2 rows containing non-finite values (stat_density2d).

subsetting

penguins %>% 
  select(species, island, ends_with("mm")) %>% 
  filter(species == "Adelie")
## # A tibble: 152 x 5
##    species island    bill_length_mm bill_depth_mm flipper_length_mm
##    <fct>   <fct>              <dbl>         <dbl>             <int>
##  1 Adelie  Torgersen           39.1          18.7               181
##  2 Adelie  Torgersen           39.5          17.4               186
##  3 Adelie  Torgersen           40.3          18                 195
##  4 Adelie  Torgersen           NA            NA                  NA
##  5 Adelie  Torgersen           36.7          19.3               193
##  6 Adelie  Torgersen           39.3          20.6               190
##  7 Adelie  Torgersen           38.9          17.8               181
##  8 Adelie  Torgersen           39.2          19.6               195
##  9 Adelie  Torgersen           34.1          18.1               193
## 10 Adelie  Torgersen           42            20.2               190
## # … with 142 more rows

The dplyr package

Often, the data we have do not contain exactly what we need. We might need to change the order of factors, create new variables based on other columns in the data, or even variables conditional on specific values in other columns. Perhaps we even want to alter all columns of a specific type in the same way?

These are the things we will cover in this session.

Yesterday we went through functions in the dplyr package that are about subsetting and ordering the data in a data set.

This session will focus on the mutate function, which is dplyr’s function to create or alter variables in a data set.

  1. select() (covered in Day 1 session)
  2. filter() (covered in Day 1 session)
  3. arrange() (covered in Day 1 session)
  4. mutate() (covered in this session)
  5. group_by() (covered in this session)
  6. summarize() (covered in Day 3 session)

Adding new variables pt.1

Let us first talk about selecting columns. In dplyr, the function name for selecting columns is select()! That should be easy to remember. In fact, the tidyverse names for functions are inspired by English grammar, which will help us when we are writing our code.

As yesterday, we need to start off by making sure we have the tidyverse package loaded, and the penguins dataset ready at hand.

library(tidyverse)
penguins <- palmerpenguins::penguins

In tidyverse, when we add new variables, we use the mutate function. Just like the other tidyverse functions, mutate work specifically with data sets, and provides a nice shorthand for working directly with the columns in the data set.

penguins %>% 
  mutate(new_var = 1)
## # A tibble: 344 x 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## #   new_var <dbl>

The output of this can be hard to spot, depending on the size of the screen. But you should be able to spot a new column in the data set called “new_var”, and it has the value 1 for all rows!

This is what we told mutate to do! We specified a new column by name, and gave it a specific value 1.

This works because its easy to assing a single value to all rows. What if we try to give it three values? What would we expect?

penguins %>% 
  mutate(var = 1:3)
Error: Problem with `mutate()` input `var`. x Input `var` can't be recycled to size 344. ℹ Input `var` is `1:3`. ℹ Input `var` must be size 344 or 1, not 3.

Here, its failing with a mysterious message. The error is telling us that input must be of size 344 or 1. 344 are the number of rows in the data set, so its telling us the input we gave it is not suitable because its neither of length 344 nor of length 1.

So now we know the premises for mutate, it takes inputs that are either of the same length as there are rows in the data set or length 1.

penguins %>% 
  mutate(var = 1:344)
## # A tibble: 344 x 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>, var <int>

But generally, we create new columns based on other data in the data set. So lets do a more useful example. For instance, perhaps we want to use the ratio between the bill length and depth as a measurement for a model.

penguins %>% 
  mutate(bill_ratio = bill_length_mm/bill_depth_mm) %>% 
  select(starts_with("bill"))
## # A tibble: 344 x 3
##    bill_length_mm bill_depth_mm bill_ratio
##             <dbl>         <dbl>      <dbl>
##  1           39.1          18.7       2.09
##  2           39.5          17.4       2.27
##  3           40.3          18         2.24
##  4           NA            NA        NA   
##  5           36.7          19.3       1.90
##  6           39.3          20.6       1.91
##  7           38.9          17.8       2.19
##  8           39.2          19.6       2   
##  9           34.1          18.1       1.88
## 10           42            20.2       2.08
## # … with 334 more rows

So, here we have asked for the ratio between bill length and depth to be calculated and stored in a column named “bill_ratio”. Then we selected just the bill columns to have a peak at the output more directly.

We can do almost anything within a mutate to get the values as we want them, also use functions that exist in R to transform the data. For instance, perhaps we want to scale the variables of interest to have a mean of 0 and standard deviation of 1, which is quite common to improve statistical modelling. We can do that with the scale function.

penguins %>% 
  mutate(
    bill_ratio = bill_length_mm/bill_depth_mm,
    bill_length_mm_z = scale(bill_length_mm)
  ) %>% 
  select(starts_with("bill"))
## # A tibble: 344 x 4
##    bill_length_mm bill_depth_mm bill_ratio bill_length_mm_z[,1]
##             <dbl>         <dbl>      <dbl>                <dbl>
##  1           39.1          18.7       2.09               -0.883
##  2           39.5          17.4       2.27               -0.810
##  3           40.3          18         2.24               -0.663
##  4           NA            NA        NA                  NA    
##  5           36.7          19.3       1.90               -1.32 
##  6           39.3          20.6       1.91               -0.847
##  7           38.9          17.8       2.19               -0.920
##  8           39.2          19.6       2                  -0.865
##  9           34.1          18.1       1.88               -1.80 
## 10           42            20.2       2.08               -0.352
## # … with 334 more rows

Challenge 1.

You can run assignments in your own RStudio, or run the first challenge in the mutating tutorial by entering the following in the R console:

learnr::run_tutorial("003-mutating", "swc.tidyverse")

(helpers, please paste this into the chat at the right time.)

Assignment

1a: Create a column named bill_ld_ratio that is the value of bill_length_mm divided by bill_depth_mm

1b: transform the body mass column to the logarithmic scale using the log function

1c: Divide all values in the flipper lenght column with 10, and store it in a variable called flipper_length_cm.

Solution

## 1a
penguins %>% 
  mutate(bill_ld_ratio = bill_length_mm / bill_depth_mm)
## # A tibble: 344 x 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## #   bill_ld_ratio <dbl>
# 1b
penguins %>% 
  mutate(body_mass_g_log = log(body_mass_g))
## # A tibble: 344 x 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## #   body_mass_g_log <dbl>
## 1c
penguins %>% 
  mutate(flipper_length_cm  = flipper_length_mm/10)
## # A tibble: 344 x 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## #   flipper_length_cm <dbl>

Add variable based on data conditionals

Sometimes, we want to assign certain data values based on other variables in the data set. For instance, maybe we want to classify all penguins with body mass above 4.5 kg as “large” while the rest are “normal”?

The if_else function takes expressions much like filter. The first value after the expression is the value assigned if the expression is TRUE, while the second is if the expression is FALSE

penguins %>% 
  mutate(size = if_else(body_mass_g > 4500,
                       "large", "normal"))
## # A tibble: 344 x 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>, size <chr>

Now we have a column with two values, “large” and “normal” based on whether the penguins are above or below 4.5 kilos.

We can for instance use that in a plot.

penguins %>% 
  mutate(size = if_else(body_mass_g > 4500,
                       "large", "normal")) %>% 
  ggplot() +
  geom_jitter(aes(x = year,
                  y = body_mass_g,
                  colour = size))
## Warning: Removed 2 rows containing missing values (geom_point).

That shows us clearly that we have grouped the penguins based on their size. But there is this strange NA in the plot legend. what is that?

In R, missing values are usually given the value NA which stands for Not applicable. This is a very special name in R. Like TRUE and FALSE are capitalized, RStudio immediately recognises the combination of capital letters and gives it another colour than all other values.

Let us just explore that a tiny bit.

NA is great, because it can be used in any vector as a missing value placeholder. Remember we mentioned yesterday that a column in a data set must contain all the same type of data. Either factor, or numeric, or string etc. It cannot be a combination. So mssing values must be something a little special, to still fit into all those categories. Let us do a couple of tests.

We will use the c function in R to create new vectors. It stands for either “combine” or “concatenate”.

c("this", "is", "a", "string", "vector" ,NA)
## [1] "this"   "is"     "a"      "string" "vector" NA
c(1:4, NA, 5:8)
## [1]  1  2  3  4 NA  5  6  7  8
factor(c("fct1", "fct2", NA))
## [1] fct1 fct2 <NA>
## Levels: fct1 fct2

Notice how the NA is never quoted, its displayed in plain letters. This lets us know its not stored as a string, but as a special NA missing value. We will come across this NA especially tomorrow when we create summaries.

since its not quoted we can also not look for it like a string, and we also cannot look for it with an equals. NA’s are special, and therefore have their own function to locate them.

penguins %>% 
  filter(body_mass_g == "NA")
## # A tibble: 0 x 8
## # … with 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
## #   bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
## #   year <int>
penguins %>% 
  filter(body_mass_g == NA)
## # A tibble: 0 x 8
## # … with 8 variables: species <fct>, island <fct>, bill_length_mm <dbl>,
## #   bill_depth_mm <dbl>, flipper_length_mm <int>, body_mass_g <int>, sex <fct>,
## #   year <int>
penguins %>% 
  filter(is.na(body_mass_g))
## # A tibble: 2 x 8
##   species island bill_length_mm bill_depth_mm flipper_length_… body_mass_g sex  
##   <fct>   <fct>           <dbl>         <dbl>            <int>       <int> <fct>
## 1 Adelie  Torge…             NA            NA               NA          NA <NA> 
## 2 Gentoo  Biscoe             NA            NA               NA          NA <NA> 
## # … with 1 more variable: year <int>

Challenge 2.

You can run assignments in your own RStudio, or run the second challenge in the mutating tutorial by entering the following in the R console:

learnr::run_tutorial("003-mutating", "swc.tidyverse")

(helpers, please paste this into the chat at the right time.)

Assignment

2a: Adapt the example to evaluate if body mass is below 4.5kg, and assign rows that are TRUE to be “normal” and rows that are FALSE to “large”

2b: Create a new column named “biscoe” and its content should be TRUE if the island is “Biscoe” and FALSE for everything else.

Solution

## 2a
penguins %>% 
  mutate(body_type = if_else(body_mass_g < 4500, "normal", "large")) 
## # A tibble: 344 x 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## #   body_type <chr>
# 2b
penguins %>% 
  mutate(biscoe = if_else(island == "Biscoe", TRUE, FALSE)) %>% 
  select(species, island, biscoe)
## # A tibble: 344 x 3
##    species island    biscoe
##    <fct>   <fct>     <lgl> 
##  1 Adelie  Torgersen FALSE 
##  2 Adelie  Torgersen FALSE 
##  3 Adelie  Torgersen FALSE 
##  4 Adelie  Torgersen FALSE 
##  5 Adelie  Torgersen FALSE 
##  6 Adelie  Torgersen FALSE 
##  7 Adelie  Torgersen FALSE 
##  8 Adelie  Torgersen FALSE 
##  9 Adelie  Torgersen FALSE 
## 10 Adelie  Torgersen FALSE 
## # … with 334 more rows

Add variables consecutively

Now we know how to create new variables, and even how to make them if there are conditions on how to add the data.

But, we often want to add several columns of different types, and maybe even add new variables based on other new columns! Oh, its starting to sound complicated, but it does not have to be!

mutate is so-called lazy-evaluated. This sounds weird, but it means that each new column you make is made in the sequence you make them. So as long as you think about the order of your mutate creations, you can do that in a single mutate call.

penguins %>% 
  mutate(
    bill_ratio = bill_depth_mm/bill_length_mm,
    bill_type = if_else(bill_ratio < .5, "elongated", "stumped")
  ) %>% 
  select(starts_with("bill"))
## # A tibble: 344 x 4
##    bill_length_mm bill_depth_mm bill_ratio bill_type
##             <dbl>         <dbl>      <dbl> <chr>    
##  1           39.1          18.7      0.478 elongated
##  2           39.5          17.4      0.441 elongated
##  3           40.3          18        0.447 elongated
##  4           NA            NA       NA     <NA>     
##  5           36.7          19.3      0.526 stumped  
##  6           39.3          20.6      0.524 stumped  
##  7           38.9          17.8      0.458 elongated
##  8           39.2          19.6      0.5   stumped  
##  9           34.1          18.1      0.531 stumped  
## 10           42            20.2      0.481 elongated
## # … with 334 more rows

Now you’ve created two variables. One for bill ratio, and then another one conditional on the values on the bill ratio.

If you switched the order of these two, R would produce and error, because there would be no bill ratio to create the other columnm

penguins %>% 
  mutate(
    bill_type = if_else(bill_ratio < .5, "elongated", "stumped"),
    bill_ratio = bill_depth_mm/bill_length_mm
  ) %>% 
  select(starts_with("bill"))
Error: Problem with `mutate()` input `bill_type`. x object 'bill_ratio' not found ℹ Input `bill_type` is `if_else(bill_ratio > 1, "elongated", "stumped")

But what if we want to categorise based on more than one condition? Nested if_else?

penguins %>% 
  mutate(
    bill_ratio = bill_depth_mm/bill_length_mm,
    bill_type = if_else(bill_ratio < .35, 
                       "elongated", 
                       if_else(bill_ratio < .45, 
                              "normal",
                              "stumped"))
  ) %>% 
  select(starts_with("bill"))
## # A tibble: 344 x 4
##    bill_length_mm bill_depth_mm bill_ratio bill_type
##             <dbl>         <dbl>      <dbl> <chr>    
##  1           39.1          18.7      0.478 stumped  
##  2           39.5          17.4      0.441 normal   
##  3           40.3          18        0.447 normal   
##  4           NA            NA       NA     <NA>     
##  5           36.7          19.3      0.526 stumped  
##  6           39.3          20.6      0.524 stumped  
##  7           38.9          17.8      0.458 stumped  
##  8           39.2          19.6      0.5   stumped  
##  9           34.1          18.1      0.531 stumped  
## 10           42            20.2      0.481 stumped  
## # … with 334 more rows

ok, so it worked we think. Lets plot it to see if we agree.

penguins %>% 
  mutate(
    bill_ratio = bill_depth_mm/bill_length_mm,
    bill_type = if_else(bill_ratio < .35, 
                       "elongated", 
                       if_else(bill_ratio < .45, 
                              "normal",
                              "stumped"))
  ) %>% 
  ggplot(aes(x = bill_length_mm, 
             y = bill_depth_mm, 
             colour = bill_type)) +
  geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).

that looks ok, but the code is horrible! And what if you have even more conditionals? Thankfully, dplyr has a smarter way of doing this, called case_when. This function is similar to if_else, but where you specify what each condition should be assigned. On the left you have the logical expression, and the on the right of the tilde (~) is the value to be assigned if that expression is TRUE

penguins %>% 
  mutate(
    bill_ratio = bill_depth_mm/bill_length_mm,
    bill_type = case_when(
      bill_ratio < .35 ~ "elongated",
      bill_ratio < .45 ~ "normal",
      TRUE ~ "stumped")
  ) %>% 
  ggplot(aes(x = bill_length_mm, 
             y = bill_depth_mm, 
             colour = bill_type)) +
  geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).

That looks almost the same. The NA’s are gone! That’s not right. We cannot categorise values that are missing. Its our last statement that does this, which just says “make the remainder this value”. Which is not what we want. We need the NAs to stay NA. case_when like the mutate it self, evaluates the expressions in sequence. Which is why we can have two statements evaluating the same column with similar expressions (below .35 and then below .45). All values that are below .45 are also below .35. Since we first assign everything below .35, and then below .45, they do not collide. We can do the same for our last statement, saying that all values that are not NA should be given this category.

penguins %>% 
  mutate(
    bill_ratio = bill_depth_mm/bill_length_mm,
    bill_type = case_when(
      bill_ratio < .35 ~ "elongated",
      bill_ratio < .45 ~ "normal",
      !is.na(bill_ratio) ~ "stumped")
  ) %>% 
  ggplot(aes(x = bill_length_mm, 
             y = bill_depth_mm, 
             colour = bill_type)) +
  geom_point()
## Warning: Removed 2 rows containing missing values (geom_point).

Here, we use the is.na() function we saw earlier, on bill_ratio. But it also has an ! in front, what does that mean? In R logical expressions, the ! is a negation specifier. It means it flips the logical so the TRUE becomes FALSE, and vice versa. So here, it means there bill_ratio is not NA.

Challenge 3.

You can run assignments in your own RStudio, or run the third challenge in the mutating tutorial by entering the following in the R console:

learnr::run_tutorial("003-mutating", "swc.tidyverse")

(helpers, please paste this into the chat at the right time.)

Assignment

3a: Create a new column called body_type which should contain the categories “large” for penguins above 4 kilos, “petite” penguins below 3 kilos, and “normal” for all the remaining penguins that have their mass recorded.

Solution

## 3a
penguins %>% 
  mutate(
    body_type = case_when(
      body_mass_g > 4500 ~ "large",
      body_mass_g < 3000 ~ "petite",
      !is.na(body_mass_g) ~ "normal") # the rest
  ) 
## # A tibble: 344 x 9
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
## #   body_type <chr>

Adding or altering variables en masse

So far, we’ve been looking at adding variables one by one. This is of course something we do all the time, but some times we need to do the same change to multiple columns at once. Imagine you have a data set with 20 column and you want to scale them all to the same scale. Writing the same command with different columns 20 times is very tedious. It is now the dplyr package truly starts to shine!

In our case, let us say we want to scale the three columns with milimetre measurements so that they have a mean of 0 and standard deviation of 1. We’ve alrady used the scale function once before, so we will do it again.

In this simple example we might have done so:

penguins %>% 
  mutate(
    bill_depth_sc = scale(bill_depth_mm),
    bill_length_sc = scale(bill_length_mm),
    flipper_length_sc = scale(flipper_length_mm),
  )
## # A tibble: 344 x 11
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 5 more variables: sex <fct>, year <int>,
## #   bill_depth_sc <dbl[,1]>, bill_length_sc <dbl[,1]>,
## #   flipper_length_sc <dbl[,1]>

Its just three columns, we can do that.But let us imagine we have 20 of these, typing all that out is tedious and error prone. You might forget to alter the name or keep the same type of naing convention. We are only human, we easily make mistakes.

With dplyr’s across() we can combine our knowlede of tidy-selectors and mutate to create the entire transformation for these columns at once.

penguins %>% 
  mutate(across(ends_with("mm"),
                scale))
## # A tibble: 344 x 8
##    species island bill_length_mm[… bill_depth_mm[,… flipper_length_… body_mass_g
##    <fct>   <fct>             <dbl>            <dbl>            <dbl>       <int>
##  1 Adelie  Torge…           -0.883            0.784           -1.42         3750
##  2 Adelie  Torge…           -0.810            0.126           -1.06         3800
##  3 Adelie  Torge…           -0.663            0.430           -0.421        3250
##  4 Adelie  Torge…           NA               NA               NA              NA
##  5 Adelie  Torge…           -1.32             1.09            -0.563        3450
##  6 Adelie  Torge…           -0.847            1.75            -0.776        3650
##  7 Adelie  Torge…           -0.920            0.329           -1.42         3625
##  8 Adelie  Torge…           -0.865            1.24            -0.421        4675
##  9 Adelie  Torge…           -1.80             0.480           -0.563        3475
## 10 Adelie  Torge…           -0.352            1.54            -0.776        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

whoa! So fast and so simple! Now the three columns are scaled. But oh no! The columns have been overwritten. Rather than creating new ones, we replaced the old ones.

This might be your intention in some instances, or maybe you will just create a new dataset with the scaled variables.

penguins_mm_sc <- penguins %>% 
  mutate(across(ends_with("mm"),
                scale))

but often, we’d like to keep the original but add the new variants. We can do that to within the across!

penguins %>% 
  mutate(across(ends_with("mm"),
                scale, 
                .names = "{.col}_sc")) %>% 
  select(contains("mm"))
## # A tibble: 344 x 6
##    bill_length_mm bill_depth_mm flipper_length_mm bill_length_mm_sc[,1]
##             <dbl>         <dbl>             <int>                 <dbl>
##  1           39.1          18.7               181                -0.883
##  2           39.5          17.4               186                -0.810
##  3           40.3          18                 195                -0.663
##  4           NA            NA                  NA                NA    
##  5           36.7          19.3               193                -1.32 
##  6           39.3          20.6               190                -0.847
##  7           38.9          17.8               181                -0.920
##  8           39.2          19.6               195                -0.865
##  9           34.1          18.1               193                -1.80 
## 10           42            20.2               190                -0.352
## # … with 334 more rows, and 2 more variables: bill_depth_mm_sc <dbl[,1]>,
## #   flipper_length_mm_sc <dbl[,1]>

now they are all there! neat! But that .names argument is a little weird. what does it really mean?

Internally, across stores the column names in a vector it calls .col. We can use this knowledge to tell the across function what to name our new columns. In this case, we want to append the column name with _sc.

Challenge 4.

You can run assignments in your own RStudio, or run the third challenge in the mutating tutorial by entering the following in the R console:

learnr::run_tutorial("003-mutating", "swc.tidyverse")

(helpers, please paste this into the chat at the right time.)

Assignment

4a: Transform all the colmns with milimetres measurements so they are scaled, and the prefix “sc_” to the columns names. 4b: Do the same, but now only for the bill measurements.

Solution

## 4a
penguins %>% 
  mutate(across(ends_with("mm"),
                scale, 
                .names = "sc_{.col}")) %>% 
  select(contains("mm"))
## # A tibble: 344 x 6
##    bill_length_mm bill_depth_mm flipper_length_mm sc_bill_length_mm[,1]
##             <dbl>         <dbl>             <int>                 <dbl>
##  1           39.1          18.7               181                -0.883
##  2           39.5          17.4               186                -0.810
##  3           40.3          18                 195                -0.663
##  4           NA            NA                  NA                NA    
##  5           36.7          19.3               193                -1.32 
##  6           39.3          20.6               190                -0.847
##  7           38.9          17.8               181                -0.920
##  8           39.2          19.6               195                -0.865
##  9           34.1          18.1               193                -1.80 
## 10           42            20.2               190                -0.352
## # … with 334 more rows, and 2 more variables: sc_bill_depth_mm <dbl[,1]>,
## #   sc_flipper_length_mm <dbl[,1]>
## 4b
penguins %>% 
  mutate(across(starts_with("mm"),
                scale, 
                .names = "sc_{.col}")) %>% 
  select(starts_with("mm"))
## # A tibble: 344 x 0
## 4c
penguins %>% 
  mutate(across(where(is.numeric),
                scale, 
                .names = "sc_{.col}")) %>% 
  select(where(is.numeric))
## # A tibble: 344 x 10
##    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##             <dbl>         <dbl>             <int>       <int> <int>
##  1           39.1          18.7               181        3750  2007
##  2           39.5          17.4               186        3800  2007
##  3           40.3          18                 195        3250  2007
##  4           NA            NA                  NA          NA  2007
##  5           36.7          19.3               193        3450  2007
##  6           39.3          20.6               190        3650  2007
##  7           38.9          17.8               181        3625  2007
##  8           39.2          19.6               195        4675  2007
##  9           34.1          18.1               193        3475  2007
## 10           42            20.2               190        4250  2007
## # … with 334 more rows, and 5 more variables: sc_bill_length_mm <dbl[,1]>,
## #   sc_bill_depth_mm <dbl[,1]>, sc_flipper_length_mm <dbl[,1]>,
## #   sc_body_mass_g <dbl[,1]>, sc_year <dbl[,1]>

Adding variables based on grouped data

Sometimes, it makes sense to calculate values based on some grouping variable. In this case, for instance species, island or sex. For instance,

With dplyr, we can utilise something called data grouping, to achieve what we are after. We can group the data.

When data is grouped by one or more columns in the data, one can apply calculations based on summary measures for the groups on each individual score. This is powerful when you want to calculate which percentile a score falls in, or other relational measures (like time since baseline).

penguins %>% 
  ggplot(aes(x = bill_length_mm,
             fill = species)) +
  geom_density(alpha = .5)
## Warning: Removed 2 rows containing non-finite values (stat_density).

Lets try that by calculating which percentile a measurement of bill length is within its species. We start by grouping the data, and looking at what might have changed.

penguins %>% 
  group_by(species)
## # A tibble: 344 x 8
## # Groups:   species [3]
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen           39.1          18.7               181        3750
##  2 Adelie  Torgersen           39.5          17.4               186        3800
##  3 Adelie  Torgersen           40.3          18                 195        3250
##  4 Adelie  Torgersen           NA            NA                  NA          NA
##  5 Adelie  Torgersen           36.7          19.3               193        3450
##  6 Adelie  Torgersen           39.3          20.6               190        3650
##  7 Adelie  Torgersen           38.9          17.8               181        3625
##  8 Adelie  Torgersen           39.2          19.6               195        4675
##  9 Adelie  Torgersen           34.1          18.1               193        3475
## 10 Adelie  Torgersen           42            20.2               190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

This look very similar, but notice at the top, it now says Groups: species (3). This is dplyr’s way of telling us the data is grouped by the species column, and that there are three groups. We know that should be right since there are three species in the data set.

Now, let us calculate the species maximum value of bill length. And then use this maximum value to find where the bill length measurements fall relative to that.

penguins %>% 
  group_by(species) %>% 
  mutate(
    bill_length_sp_max = max(bill_length_mm, na.rm = TRUE),
    bill_length_rel = bill_length_mm/bill_length_sp_max
  ) %>% 
  select(species, island, starts_with("bill_length"))
## # A tibble: 344 x 5
## # Groups:   species [3]
##    species island    bill_length_mm bill_length_sp_max bill_length_rel
##    <fct>   <fct>              <dbl>              <dbl>           <dbl>
##  1 Adelie  Torgersen           39.1                 46           0.85 
##  2 Adelie  Torgersen           39.5                 46           0.859
##  3 Adelie  Torgersen           40.3                 46           0.876
##  4 Adelie  Torgersen           NA                   46          NA    
##  5 Adelie  Torgersen           36.7                 46           0.798
##  6 Adelie  Torgersen           39.3                 46           0.854
##  7 Adelie  Torgersen           38.9                 46           0.846
##  8 Adelie  Torgersen           39.2                 46           0.852
##  9 Adelie  Torgersen           34.1                 46           0.741
## 10 Adelie  Torgersen           42                   46           0.913
## # … with 334 more rows

What are we looking at here? Our new column for species maximum seems to have just one value in it. Isnt that odd? On the top here, we are only look at “Adelie” penguins, so that is a single species, and they should all have the same value. We should check the others.

penguins %>% 
  group_by(species) %>% 
  mutate(
    bill_length_sp_max = max(bill_length_mm, na.rm = TRUE),
    bill_length_rel = bill_length_mm/bill_length_sp_max
  ) %>% 
  select(species, bill_length_sp_max) %>% 
  distinct()
## # A tibble: 3 x 2
## # Groups:   species [3]
##   species   bill_length_sp_max
##   <fct>                  <dbl>
## 1 Adelie                  46  
## 2 Gentoo                  59.6
## 3 Chinstrap               58

Here, we selected only the species and the species maximum column, and used a function called distinct() which removes all rows that are duplicates. In this case, it leaves us with three rows. One for each species with a value for each.

Now that we have confirmed that there are different values for the different species, lets look at the rest in a plot.

penguins %>% 
  group_by(species) %>% 
  mutate(
    bill_length_sp_max = max(bill_length_mm, na.rm = TRUE),
    bill_length_rel = bill_length_mm/bill_length_sp_max
  ) %>% 
  
    ggplot(aes(x = bill_length_rel,
             fill = species)) +
  geom_density(alpha = .5)
## Warning: Removed 2 rows containing non-finite values (stat_density).

Now we can see in our plot, that we have created columns as expected. We can now see that the distributions have moved, while still maintaining their shape. So they are now relative to each species.

And we can even do that in across! Maybe you want to figure out how bill length and depth relate to each other regardless of which species of penguins. You’d somehow need to de-mean the data based on species, before doing analyses.

Let us plot the data and see what things look like to begin with.

penguins %>% 
  ggplot(aes(x = bill_length_mm,
             y = bill_depth_mm,
             colour = species)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

This looks quite similar between the species. But they are on slightly different scales, and we’d like an even clearer comparison basis.

penguins %>% 
  group_by(species) %>% 
  mutate(across(starts_with("bill"),
                ~ .x - mean(.x, na.rm = TRUE)))
## # A tibble: 344 x 8
## # Groups:   species [3]
##    species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
##    <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
##  1 Adelie  Torgersen          0.309         0.354               181        3750
##  2 Adelie  Torgersen          0.709        -0.946               186        3800
##  3 Adelie  Torgersen          1.51         -0.346               195        3250
##  4 Adelie  Torgersen         NA            NA                    NA          NA
##  5 Adelie  Torgersen         -2.09          0.954               193        3450
##  6 Adelie  Torgersen          0.509         2.25                190        3650
##  7 Adelie  Torgersen          0.109        -0.546               181        3625
##  8 Adelie  Torgersen          0.409         1.25                195        4675
##  9 Adelie  Torgersen         -4.69         -0.246               193        3475
## 10 Adelie  Torgersen          3.21          1.85                190        4250
## # … with 334 more rows, and 2 more variables: sex <fct>, year <int>

Break it down. First we group the data, then we mutate across all columns starting with “bill”, and applying a strange function.

Remember how when you renamed the columns you learned about the across functions internal name for the column .col. Across also stores the input vector in the same way in a vector called .x. So when we create our own specification of what we want across to do (rather than applying a function like scale), we use the tilde ~ which allows us to use this internal .x vector to specify our wanted behaviour. When we demean data, we take the value and subtract the mean of the entire vector from that value. That means values larger than the mean stay positive, while values smaller than the mean become negative.

Let us plot it to see if it did what we expected.

penguins %>% 
  group_by(species) %>% 
  mutate(across(starts_with("bill"),
                ~ .x - mean(.x, na.rm = TRUE))) %>% 
    ggplot(aes(x = bill_length_mm,
             y = bill_depth_mm,
             colour = species)) +
  geom_point() +
  geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 2 rows containing non-finite values (stat_smooth).
## Warning: Removed 2 rows containing missing values (geom_point).

hurrah! That looks right to me. Now all values are around 0, and we can clearly see that the relationship between bill length and depth is pretty much the same between the species.

Challenge 5.

You can run assignments in your own RStudio, or run the third challenge in the mutating tutorial by entering the following in the R console:

learnr::run_tutorial("003-mutating", "swc.tidyverse")

(helpers, please paste this into the chat at the right time.)

Assignment

5a: De-mean the bill length column. Do it in two steps, first by making a column for the species means, then using that data to de-mean the bill length.

5b: Do the same, but now in a single step (i.e. do not store the species mean in its own column)

5c: Based on the code in the previous example, adapt it to be grouped by island in stead of species.

5d: De-mean all the numerical columns, and give them a suffix of “dm”.

Solution

## 5a
penguins %>% 
  group_by(species) %>% 
  mutate(
    bill_length_sp_mean = mean(bill_length_mm, na.rm = TRUE),
    bill_length_cent = bill_length_mm - bill_length_sp_mean
  ) %>% 
  select(species, island, starts_with("bill"))
## # A tibble: 344 x 6
## # Groups:   species [3]
##    species island bill_length_mm bill_depth_mm bill_length_sp_… bill_length_cent
##    <fct>   <fct>           <dbl>         <dbl>            <dbl>            <dbl>
##  1 Adelie  Torge…           39.1          18.7             38.8            0.309
##  2 Adelie  Torge…           39.5          17.4             38.8            0.709
##  3 Adelie  Torge…           40.3          18               38.8            1.51 
##  4 Adelie  Torge…           NA            NA               38.8           NA    
##  5 Adelie  Torge…           36.7          19.3             38.8           -2.09 
##  6 Adelie  Torge…           39.3          20.6             38.8            0.509
##  7 Adelie  Torge…           38.9          17.8             38.8            0.109
##  8 Adelie  Torge…           39.2          19.6             38.8            0.409
##  9 Adelie  Torge…           34.1          18.1             38.8           -4.69 
## 10 Adelie  Torge…           42            20.2             38.8            3.21 
## # … with 334 more rows
## 5b
penguins %>% 
  group_by(species) %>% 
  mutate(
    bill_length_cent = bill_length_mm - mean(bill_length_mm, na.rm = TRUE)
  ) %>% 
  select(species, island, starts_with("bill"))
## # A tibble: 344 x 5
## # Groups:   species [3]
##    species island    bill_length_mm bill_depth_mm bill_length_cent
##    <fct>   <fct>              <dbl>         <dbl>            <dbl>
##  1 Adelie  Torgersen           39.1          18.7            0.309
##  2 Adelie  Torgersen           39.5          17.4            0.709
##  3 Adelie  Torgersen           40.3          18              1.51 
##  4 Adelie  Torgersen           NA            NA             NA    
##  5 Adelie  Torgersen           36.7          19.3           -2.09 
##  6 Adelie  Torgersen           39.3          20.6            0.509
##  7 Adelie  Torgersen           38.9          17.8            0.109
##  8 Adelie  Torgersen           39.2          19.6            0.409
##  9 Adelie  Torgersen           34.1          18.1           -4.69 
## 10 Adelie  Torgersen           42            20.2            3.21 
## # … with 334 more rows
## 5c
penguins %>% 
  group_by(species) %>% 
  mutate(
    across(where(is.numeric),
           ~ .x - mean(.x, na.rm = TRUE),
           .names = "{.col}_dm")
  ) %>% 
  select(species, where(is.numeric))
## # A tibble: 344 x 11
## # Groups:   species [3]
##    species bill_length_mm bill_depth_mm flipper_length_mm body_mass_g  year
##    <fct>            <dbl>         <dbl>             <int>       <int> <int>
##  1 Adelie            39.1          18.7               181        3750  2007
##  2 Adelie            39.5          17.4               186        3800  2007
##  3 Adelie            40.3          18                 195        3250  2007
##  4 Adelie            NA            NA                  NA          NA  2007
##  5 Adelie            36.7          19.3               193        3450  2007
##  6 Adelie            39.3          20.6               190        3650  2007
##  7 Adelie            38.9          17.8               181        3625  2007
##  8 Adelie            39.2          19.6               195        4675  2007
##  9 Adelie            34.1          18.1               193        3475  2007
## 10 Adelie            42            20.2               190        4250  2007
## # … with 334 more rows, and 5 more variables: bill_length_mm_dm <dbl>,
## #   bill_depth_mm_dm <dbl>, flipper_length_mm_dm <dbl>, body_mass_g_dm <dbl>,
## #   year_dm <dbl>

Wrap up

Now we’ve learned a little about adding and altering variables in data sets using dplyr’s mutate function. Both alone, and also together with across and group_by. You should be able to play around with the examples provided and learn more about how things work through trial and error.